home *** CD-ROM | disk | FTP | other *** search
- Subject: How to syntax-colour Inform
- Date: Wed, 17 Dec 1997 18:46:34 +0000 (GMT)
- From: Graham Nelson <graham@gnelson.demon.co.uk>
- Newsgroups: rec.arts.int-fiction
-
- [This is going to be a new section in the Inform Technical Manual,
- which seems as good a place to keep it as any, but in the mean time
- it's been requested several times on the newsgroup, hence this
- posting. Comments welcome -- GN.]
-
- How to syntax-colour Inform source code
- ---------------------------------------
-
- "Syntax colouring" is an automatic process which some text editors apply
- to the text being edited: the characters are displayed just as they are,
- but with artificial colours added according to what the text editor thinks
- they mean. The editor is in the position of someone going through a book
- colouring all the verbs in red and all the nouns in green: it can only do
- so if it understands how to tell a verb or a noun from other words.
- Many good text editors have been programmed to syntax colour for languages
- such as C, and a few will allow users to reprogram them to other languages.
-
- One such is the popular Acorn RISC OS text editor "Zap", for which the
- author has written an extension mode called "ZapInform". ZapInform
- contributes colouring rules for the Inform language and as over a dozen
- people have now asked me how it works, while the original is written
- in ARM assembly (a language rather less widely spoken than Middle Egyptian)
- it seems worth documenting the main algorithm.
-
- (ZapInform does a number of other useful things, including pasting in
- template objects and rooms when commanded from a mouse-accessed menu:
- for instance, you can create a simple game with two or three mouse
- clicks and a few object names typed in to a dialogue box, then click
- to save and compile the result. See the ZapInform manual for details.)
-
- (a) State values
-
- ZapInform associates a 32-bit number called the "state" with every
- character position.
-
- The "state" is as follows. 11 of the upper 16 bits hold flags, the
- rest being unused:
-
- 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17
- comment
- single-quoted text
- double-quoted text
- statement
- after marker
- highlight flag
- highlight all flag
- colour backtrack
- after-restart-flag
- wait-direct (waiting for a directive)
- dont-know-flag
-
- These flags make up the "outer state" while the lower 16 bits holds
- a number pompously called the "inner state":
-
- 0 after WS (WS = white space or start of line or comma)
- 1 after WS then "-"
- 2 after WS then "-" and ">" [terminal]
- 3 after WS then "*" [terminal]
-
- 0xFF after junk
- 0x100*N + S after WS then an Inform identifier N+1 characters long
- itself in state S:
- 101 w 202 wi 303 wit 404 with
- 111 h 212 ha 313 has
- 121 c 222 cl 323 cla 424 clas 525 class
- same + 0x8000 when complete [terminal]
-
- In practice it would be madness to try to actually store the state
- of every character position in memory (it would occupy four times as
- much space as the file itself). Instead, ZapInform caches just one
- state value, the one most recently calculated, and uses a process
- called "scanning" to determine new states. That is, given that we
- know the state at character X and want to know the state at character
- Y, we can find out by scanning each character between X and Y,
- altering the state according to each one.
-
- It might possible save some time to cache more state values than
- this (say, the state values at the start of every screen-visible
- line of text, or some such) but the complexity of doing this doesn't
- seem worthwhile on my implementation. Scanning is a quick process
- because the Zap text editor stores the entire file in almost contiguous
- memory, easy to run through, and the state value can be kept in a
- single CPU register while this is done.
-
- (b) Scanning text
-
- Let us number the characters in a file 1, 2, 3, ...
-
- The state before character 1 is always 0x02000000: that is, inner
- state zero and outer state with only the waiting-for-directive flag set.
- (One can think of this as the state of an imaginary "character 0".)
- The state at character N+1 is then a function of the state at
- character N and what character is actually there. Thus,
-
- State(0) = 0x02000000
-
- and for all N >= 0,
-
- State(N+1) = Scanning_function(State(N), Character(N+1))
-
- And here is what the scanning function does:
-
- 1. Is the comment bit set?
- Is the character a new-line?
- If so, clear the comment bit.
- Stop.
-
- 2. Is the double-quote bit set?
- Is the character a double-quote?
- If so, clear the double-quote bit.
- Stop.
-
- 3. Is the single-quote bit set?
- Is the character a single-quote?
- If so, clear the single-quote bit.
- Stop.
-
- 4. Is the character a single quote?
- If so, set the single-quote bit and stop.
-
- 5. Is the character a double quote?
- If so, set the double-quote bit and stop.
-
- 6. Is the character an exclamation mark?
- If so, set the comment bit and stop.
-
- 7. Is the statement bit set?
- If so:
- Is the character "]"?
- If so:
- Clear the statement bit.
- Stop.
-
- If the after-restart bit is clear, stop.
-
- Run the inner finite state machine.
-
- If it results in a keyword terminal (that is, a terminal
- which has inner state 0x100 or above):
- Set colour-backtrack (and record the backtrack colour
- as "function" colour).
- Clear after-restart.
-
- Stop.
-
- If not:
- Is the character "["?
- If so:
- Set the statement bit.
- If the after-marker bit is set, set after-restart.
- Stop.
-
- Run the inner finite state machine.
-
- If it results in a terminal:
- Is the inner state 2 [after "->"] or 3 [after "*"]?
- If so:
- Set after-marker.
- Set colour-backtrack (and record the backtrack
- colour as "directive" colour).
- Zero the inner state.
- [If not, the terminal must be from a keyword.]
- Is the inner state 0x404 [after "with"]?
- If so:
- Set colour-backtrack (and record the backtrack
- colour as "directive" colour).
- Set after-marker.
- Set highlight.
- Clear highlight-all.
- Is the inner state 0x313 ["has"] or 0x525 ["class"]?
- If so:
- Set colour-backtrack (and record the backtrack
- colour as "directive" colour).
- Set after-marker.
- Clear highlight.
- Set highlight-all.
- If the inner state isn't one of these: [so that recent
- text has formed some alphanumeric token which might or
- might not be a reserved word of some kind]
- If waiting for directive is set:
- Set colour-backtrack (and record the backtrack
- colour as "directive" colour)
- If not, but highlight-all is set:
- Set colour-backtrack (and record the backtrack
- colour as "property" colour)
- If not, but highlight is set:
- Clear highlight.
- Set colour-backtrack (and record the backtrack
- colour as "property" colour).
-
- Is the character ";"?
- If so:
- Set wait-direct.
- Clear after-marker.
- Clear after-restart.
- Clear highlight.
- Clear highlight-all.
- Is the character ","?
- If so:
- Set after-marker.
- Set highlight.
-
- Stop.
-
- The "inner finite state machine" adjusts only the inner state, and
- always preserves the outer state. It not only changes an old inner
- state to a new inner state, but sometimes returns a "terminal" flag
- to signal that something interesting has been found.
-
- State Condition Go to state Return terminal-flag?
- 0 if "-" 1
- if "*" 3 yes
- if space, "#",
- newline 0
- if "_" 0x100
- if "w" 0x101
- if "h" 0x111
- if "c" 0x121
- other letters 0x100
- otherwise 0xFF
- 1 if ">" 2 yes
- otherwise 0xFF
- 2 always 0
- 3 always 0
- 0xFF if space,
- newline 0
- otherwise 0xFF
-
- all 0x100+ states:
- if not alphanumeric, add
- 0x8000 to the state yes
- then for the following states:
- 0x101 if "i" 0x202
- otherwise 0x200
- 0x202 if "t" 0x303
- otherwise 0x300
- 0x303 if "h" 0x404
- otherwise 0x400
- 0x111 if "a" 0x212
- otherwise 0x200
- 0x212 if "s" 0x313
- otherwise 0x300
- 0x121 if "l" 0x222
- otherwise 0x200
- 0x222 if "a" 0x323
- otherwise 0x300
- 0x323 if "s" 0x424
- otherwise 0x400
- 0x424 if "s" 0x525
- otherwise 0x500
- but for all other 0x100+ states:
- if alphanumeric, add
- 0x100 to the state
-
- 0x8000+ always 0
-
- (Note that if your text editor stores tabs as characters in their own
- right (usually 0x09) rather than rows of spaces, tab should be included
- with space and newline in the above.)
-
- Briefly, the finite state machine can be left running until it returns
- a terminal, which means it has found "->", "*" or a completed Inform
- identifier: and it detects "with", "has" and "class" as special keywords
- amongst these identifiers.
-
- (c) Initial colouring
-
- ZapInform colours one line of visible text at a time. For instance, it
- might be faced with this:
-
- Object -> bottle "~Heinz~ bottle"
-
- And it outputs an array of colours for each character position in the
- line, which the text editor can then use in actually displaying the text.
-
- It works out the state before the first character of the line (the "O"),
- then scans through the line. For each character, it determines the
- initial colour as a function of the state at that character:
-
- If single-quote or double-quote is set, then quoted text colour.
- If comment is set, then comment colour.
- If statement is set:
- Use code colour
- unless the character is "[" or "]", in which case use
- function colour,
- or is a single or double quote, in which case use quoted text
- colour.
- If not:
- Use foreground colour
- unless the character is "," or ";" or "*" or ">", in which
- case use directive colour,
- or the character is "[" or "]", in which case use
- function colour,
- or is a single or double quote, in which case use quoted text
- colour.
-
- However, the scanning algorithm sometimes signals that a block of
- text must be "backtracked" through and recoloured. For instance,
- this happens if the white space after the sequence "c", "l", "a",
- "s" and "s" is detected when in a context where the keyword "class"
- is legal. The scanning algorithm does this by setting the "colour
- backtrack" bit in the outer state. Note that the number of characters
- we need to recolour backwards from the current position has been
- recorded in bits 9 to 16 of the inner state (which has been counting
- up lengths of identifiers), while the scanning algorithm has also
- recorded the colour to be used. For instance, in
-
- Object -> bottle "~Heinz~ bottle"
- ^ ^ ^
-
- backtracks of size 6, 2 and 6 are called for at the three marked
- spaces. Note that a backtrack never crosses a new-line.
-
- ZapInform uses the following chart of colours:
-
- name default actual colour
-
- foreground navy blue
- quoted text grey
- comment light green
- directive black
- property red
- function red
- code navy blue
- codealpha dark green
- assembly gold
- escape character red
-
- but note that at this stage, we've only used the following:
-
- function colour [ and ] as function brackets, plus function names
- comment colour comments
- directive colour initial directive keywords, plus "*",
- "->", "with", "has" and "class" when used
- in a directive context
- quoted text colour singly- or doubly-quoted text
- foreground colour code in directives
- code colour code in statements
- property colour property, attribute and class names when
- used within "with", "has" and "class"
-
- For instance,
-
- Object -> bottle "~Heinz~ bottle"
-
- would give us the array
-
- DDDDDDDDDDFFFFFFFQQQQQQQQQQQQQQQQ
-
- (F being foreground colour; it doesn't really matter what colour
- values the spaces have).
-
- (d) Colour refinement
-
- The next operation is "colour refinement", which includes a number
- of things.
-
- Firstly, any characters with colour Q (quoted-text) which have special
- meanings are given "escape-character colour" instead. This applies
- to "~", "^", "\" and "@" followed by (possibly) another "@" and a
- number of digits.
-
- Next we look for identifiers. An identifier for these purposes includes
- a number, for it is just a sequence of:
-
- "_" or "$" or "#" or "0" to "9" or "a" to "z" or "A" to "Z".
-
- The initial colouring of an identifier tells us its context. We're
- only interested in those in foreground colour (these must be used
- in the body of a directive) or code colour (used in statements).
-
- If an identifier is in code colour, then:
-
- If it follows an "@", recolour the "@" and the identifier in
- assembly-language colour.
- Otherwise, unless it is one of the following:
-
- "box" "break" "child" "children" "continue" "default"
- "do" "elder" "eldest" "else" "false" "font" "for" "give"
- "has" "hasnt" "if" "in" "indirect" "inversion" "jump"
- "metaclass" "move" "new_line" "nothing" "notin" "objectloop"
- "ofclass" "or" "parent" "print" "print_ret" "provides" "quit"
- "random" "read" "remove" "restore" "return" "rfalse" "rtrue"
- "save" "sibling" "spaces" "string" "style" "switch" "to"
- "true" "until" "while" "younger" "youngest"
-
- we recolour the identifier to "codealpha colour".
-
- On the other hand, if an identifier is in foreground colour, then we
- check it to see if it's one of the following interesting keywords:
-
- "first" "last" "meta" "only" "private" "replace" "reverse"
- "string" "table"
-
- If it is, we recolour it in directive colour.
-
- Thus, after colour refinement we arrive at the final colour scheme:
-
- function colour [ and ] as function brackets, plus function names
- comment colour comments
- quoted text colour singly- or doubly-quoted text
- directive colour initial directive keywords, plus "*",
- "->", "with", "has" and "class" when used
- in a directive context, plus any of the
- reserved directive keywords listed above
- property colour property, attribute and class names when
- used within "with", "has" and "class"
- foreground colour everything else in directives
- code colour operators, numerals, brackets and statement
- keywords such as "if" or "else" occurring
- inside routines
- codealpha colour variable and constant names occurring inside
- routines
- assembly colour @ plus assembly language opcodes
- escape char colour special or escape characters in quoted text
-
- (e) An example
-
- Consider the following example stretch of code (which is not meant to
- be functional or interesting, just colourful):
-
- ! Here's the bottle:
-
- Object -> bottle "bottle marked ~DRINK ME~"
- with name "bottle" "jar" "flask",
- initial "There is an empty bottle here.",
- before
- [; LetGo: ! For dealing with water
- if (noun in bottle)
- "You're holding that already (in the bottle).";
- ],
- has container;
-
- [ ReadableSpell i j k;
- if (scope_stage==1)
- { if (action_to_be==##Examine) rfalse;
- rtrue;
- }
- @set_cursor 1 1;
- ];
-
- Extend "examine" first
- * scope=ReadableSpell -> Examine;
-
- Here are the initial colourings:
-
- ! Here's the bottle:
- CCCCCCCCCCCCCCCCCCCC
-
- Object -> bottle "bottle marked ~DRINK ME~"
- DDDDDDDDDDFFFFFFFQQQQQQQQQQQQQQQQQQQQQQQQQQ
- with name "bottle" "jar" "flask",
- FFDDDDDPPPPPQQQQQQQQFQQQQQFQQQQQQQD
- initial "There is an empty bottle here.",
- FFFFFFFPPPPPPPPQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQD
- before
- FFFFFFFPPPPPP
- [; LetGo: ! For dealing with water
- FFFFFFFfSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSCCCCCCCCCCCCCCCCCCCCCCCC
- if (noun in bottle)
- SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
- "You're holding that already (in the bottle).";
- SSSSSSSSSSSSSSSSSQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQS
- ],
- SSSSSSSfD
- has container;
- FFDDDDDPPPPPPPPPD
-
- [ ReadableSpell i j k;
- fffffffffffffffSSSSSSS
- if (scope_stage==1)
- SSSSSSSSSSSSSSSSSSSSS
- { if (action_to_be==##Examine) rfalse;
- SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
- rtrue;
- SSSSSSSSSSSS
- }
- SSS
- @set_cursor 1 1;
- SSSSSSSSSSSSSSSSSS
- ];
- fD
-
- Extend "examine" first
- DDDDDDDQQQQQQQQQFFFFFF
- * scope=ReadableSpell -> Examine;
- FFFFFFFFFFFFFFFFDDFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFDDDFFFFFFFD
-
- (Here F=foreground, D=directive, f=function, S=code (S for
- "statement"), C=comment, P=property, Q=quoted text.) And here is
- the refinement:
-
- ! Here's the bottle:
- CCCCCCCCCCCCCCCCCCCC
-
- Object -> bottle "bottle marked ~DRINK ME~"
- DDDDDDDDDDFFFFFFFQQQQQQQQQQQQQQQEQQQQQQQQEQ
- with name "bottle" "jar" "flask",
- FFDDDDDPPPPPQQQQQQQQFQQQQQFQQQQQQQD
- initial "There is an empty bottle here.",
- FFFFFFFPPPPPPPPQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQD
- before
- FFFFFFFPPPPPP
- [; LetGo: ! For dealing with water
- FFFFFFFfSSIIIIISSSSSSSSSSSSSSSSSSSSSSSCCCCCCCCCCCCCCCCCCCCCCCC
- if (noun in bottle)
- SSSSSSSSSSSSSSSSSIIIISSSSIIIIIIS
- "You're holding that already (in the bottle).";
- SSSSSSSSSSSSSSSSSQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQS
- ],
- SSSSSSSfD
- has container;
- FFDDDDDPPPPPPPPPD
-
- [ ReadableSpell i j k;
- fffffffffffffffSSSSSSS
- if (scope_stage==1)
- SSSSSSIIIIIIIIIIISSIS
- { if (action_to_be==##Examine) rfalse;
- SSSSSSSSSSIIIIIIIIIIIISSIIIIIIIIISSSSSSSSS
- rtrue;
- SSSSSSSSSSSS
- }
- SSS
- @set_cursor 1 1;
- SSAAAAAAAAAAASISIS
- ];
- fD
-
- Extend "examine" first
- DDDDDDDQQQQQQQQQFDDDDD
- * scope=ReadableSpell -> Examine;
- FFFFFFFFFFFFFFFFDDFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFDDDFFFFFFFD
-
- (where E = escape characters, A = assembly and I = "codealpha", that
- is, identifiers cited in statement code).
-
- --
- Graham Nelson | graham@gnelson.demon.co.uk | Oxford, United Kingdom
-